2023
Visual Instruction Tuning
LLaVA paper: align LLMs with visual information through instruction tuning on image-text pairs, enabling multimodal understanding and reasoning.
Explore machine learning papers and reviews related to multimodal learning. Find insights, analysis, and implementation details.
LLaVA paper: align LLMs with visual information through instruction tuning on image-text pairs, enabling multimodal understanding and reasoning.
BLIP-2 leverages frozen image encoders and LLMs for efficient vision-language pre-training, achieving state-of-the-art multimodal performance.
CLIP explained: contrastive learning on 400M image-text pairs enables zero-shot image classification and powerful vision-language understanding.